{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 11 - Logistic Regression Continued\n", "\n", "The Akimel O'odham people, who were also known as the Pima Indians since European colonization of the US, currently have a high prevalence of diabetes. The Pima Indian Diabetes dataset contains different possible diabetes indicators and whether the person has diabetes is on [Kaggle](ttps://www.kaggle.com/uciml/pima-indians-diabetes-database) or available [here](http://comet.lehman.cuny.edu/owen/teaching/mat328/diabetes.csv).\n", "\n", "Load the dataset into a dataframe called `diabetes`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import numpy as np\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegression\n", "import statsmodels.formula.api as smf\n", "from scipy.special import expit\n", "from scipy.stats import logistic\n", "\n", "%matplotlib inline\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome
061487235033.60.627501
11856629026.60.351310
28183640023.30.672321
318966239428.10.167210
40137403516843.12.288331
\n", "
" ], "text/plain": [ " Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n", "0 6 148 72 35 0 33.6 \n", "1 1 85 66 29 0 26.6 \n", "2 8 183 64 0 0 23.3 \n", "3 1 89 66 23 94 28.1 \n", "4 0 137 40 35 168 43.1 \n", "\n", " DiabetesPedigreeFunction Age Outcome \n", "0 0.627 50 1 \n", "1 0.351 31 0 \n", "2 0.672 32 1 \n", "3 0.167 21 0 \n", "4 2.288 33 1 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "diabetes = pd.read_csv(\"diabetes.csv\")\n", "diabetes.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot a scatter plot of glucose vs. diabetes outcome." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "diabetes.plot.scatter(x = \"Glucose\", y = \"Outcome\", alpha = 0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fit a logistic regression model to this data, using Glucose as the independent variable and Outcome as the dependent variable." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Optimization terminated successfully.\n", " Current function value: 0.526510\n", " Iterations 6\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Logit Regression Results
Dep. Variable: Outcome No. Observations: 768
Model: Logit Df Residuals: 766
Method: MLE Df Model: 1
Date: Thu, 10 Oct 2019 Pseudo R-squ.: 0.1860
Time: 17:40:58 Log-Likelihood: -404.36
converged: True LL-Null: -496.74
LLR p-value: 4.418e-42
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err z P>|z| [0.025 0.975]
Intercept -5.3501 0.421 -12.713 0.000 -6.175 -4.525
Glucose 0.0379 0.003 11.647 0.000 0.031 0.044
" ], "text/plain": [ "\n", "\"\"\"\n", " Logit Regression Results \n", "==============================================================================\n", "Dep. Variable: Outcome No. Observations: 768\n", "Model: Logit Df Residuals: 766\n", "Method: MLE Df Model: 1\n", "Date: Thu, 10 Oct 2019 Pseudo R-squ.: 0.1860\n", "Time: 17:40:58 Log-Likelihood: -404.36\n", "converged: True LL-Null: -496.74\n", " LLR p-value: 4.418e-42\n", "==============================================================================\n", " coef std err z P>|z| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "Intercept -5.3501 0.421 -12.713 0.000 -6.175 -4.525\n", "Glucose 0.0379 0.003 11.647 0.000 0.031 0.044\n", "==============================================================================\n", "\"\"\"" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logit_model = smf.logit(\"Outcome ~ Glucose\", diabetes).fit()\n", "logit_model.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the equation of the logistic regression model?\n", "\n", "$$y = \\frac{1}{1 + e^{-(-5.3501 + 0.0379x)}}$$\n", "\n", "\n", "We can also plot the logistic regression model using Seaborn's `regplot()`. Use `regplot()` as if you were doing linear regression on the variables, but add in the parameter `logistic = True`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.regplot(x = \"Glucose\", y = \"Outcome\", data = diabetes, logistic = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's assess this model by computing the confusion matrix:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[443., 57.],\n", " [138., 130.]])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix = logit_model.pred_table()\n", "confusion_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What type of errors are most likely?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the sensitivity and specificity of our model:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sensitivity: 0.886\n", "Specificity: 0.48507462686567165\n" ] } ], "source": [ "true_pos = confusion_matrix[0][0]\n", "false_pos = confusion_matrix[1][0]\n", "false_neg = confusion_matrix[0][1]\n", "true_neg = confusion_matrix[1][1]\n", "sensitivity = true_pos/(true_pos + false_neg)\n", "specificity = true_neg/(true_neg + false_pos)\n", "\n", "print(\"Sensitivity:\",sensitivity)\n", "print(\"Specificity:\",specificity)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the sensitivity and specificity, what are the strong and weak points of the model?\n", "\n", "Remember that by default the confusion matrix uses 0.5 as the cut-off for whether a y value indicates 0 or 1. We can change that by passing in the new cut-off as a parameter. For example, to interpret values >= 0.7 as 1 and < 0.7 as 0, use the code:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[484., 16.],\n", " [195., 73.]])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix = logit_model.pred_table(0.7)\n", "confusion_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How did the confusion matrix change? Do you think this new model is better or worse? Recompute the sensitivity and specificity." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sensitivity: 0.968\n", "Specificity: 0.27238805970149255\n" ] } ], "source": [ "true_pos = confusion_matrix[0][0]\n", "false_pos = confusion_matrix[1][0]\n", "false_neg = confusion_matrix[0][1]\n", "true_neg = confusion_matrix[1][1]\n", "sensitivity = true_pos/(true_pos + false_neg)\n", "specificity = true_neg/(true_neg + false_pos)\n", "\n", "print(\"Sensitivity:\",sensitivity)\n", "print(\"Specificity:\",specificity)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How did the sensitivity and specificity change? Are they better or worse?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Logistic Regression with Multiple Independent Variables\n", "\n", "Let's compute a logistic regression model using all of the columns as independent variables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do any of the variables have higher p-values? If so, let's create a new logistic regression model without those columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's the equation for this logistic regression model?\n", "\n", "As before, let's compute the confusion matrix:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do you think this is an improvement over the model based only on glucose?\n", "\n", "Compute the sensitivity and specificity:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do the sensitivity and specificity compare to the simpler model? Play around with the cut-off to see what number gives the best results.\n", "\n", "### Challenges\n", "- To better understand the data, plot the distributions of the Glucose column for people with diabetes and without diabetes as overlapping histograms. How does this graph compare to the scatterplot of glucose vs. outcome? Which gives more information?\n", "- (Very challenging) A *Receiver Operating Characteristic curve* or *ROC curve* gives information about the trade-off between the true positive rate (sensitivity) and the false positive rate (1 - specificity) as the cut-off changes. To plot such a curve, use a loop to compute the true positive rate and false positive rate for multiple cut-offs (eg. 0.1, 0.2, ..., 0.8, 0.9), and plot a line plot of these values with the false positive rate on the x axis and the true positive rate on the y axis." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.8" } }, "nbformat": 4, "nbformat_minor": 2 }